Record Location and Reconfiguration in Unstructured Multiple-Record Web Documents

نویسندگان

David W. Embley

Li Xu

چکیده

Record extraction from data-rich, unstructured, multiplerecord Web documents works well [8], but only if the text for each record can be located and isolated. Although some multiple-record Web documents present records as contiguous, delineated chunks of text (which can thus be located and isolated [9]), many do not. When some values of textual records are factored out, are split unnaturally across boundaries, are joined unnaturally within boundaries, or are linked by off-page connectors, or when desired records are interspersed with records that are not of interest, it is difficult to automatically cull records and piece values together to form clean, delineated chunks of text that each represent a single record of interest. In this paper we attack this problem and propose an algorithm to find and rearrange (if necessary) records in an HTML document by attempting to maximize a record-recognition heuristic with respect to a given application ontology. Tests we conducted show that this technique properly locates and reconfigures records for all classified types of rearrangements both for artificial and for actual multiple-record Web documents.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Locating and Reconfiguring Records in Unstructured Multiple-Record Web Documents

Record extraction from data-rich, unstructured, multiplerecord Web documents works well [9], but only if the text for each record can be located and isolated. Although some multiple-record Web documents present records as contiguous, delineated chunks of text (which can thus be located and isolated [10]), many do not. When some values of textual records are factored out, are split unnaturally a...

متن کامل

Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages

Electronically available data on the Web is exploding at an ever increasing pace. Much of this data is unstructured, which makes searching hard and traditional database querying impossible. Many Web documents, however, contain an abundance of recognizable constants that together describe the essence of a document’s content. For these kinds of data-rich, multiple-record documents (e.g. advertise...

متن کامل

Recognizing Ontology-Applicable Multiple-Record Web Documents

Automatically recognizing which Web documents are “of interest” for some specified application is non-trivial. As a step toward solving this problem, we propose a technique for recognizing which multiple-record Web documents apply to an ontologically specified application. Given the values and kinds of values recognized by an ontological specification in an unstructured Web document, we apply t...

متن کامل

Adaptive Approximate Record Matching

Typographical data entry errors and incomplete documents, produce imperfect records in real world databases. These errors generate distinct records which belong to the same entity. The aim of Approximate Record Matching is to find multiple records which belong to an entity. In this paper, an algorithm for Approximate Record Matching is proposed that can be adapted automatically with input error...

متن کامل

Geographic Focus Detection for Web Documents using Multiple Location Taggers

Being able to identify locations associated to a Web resource is essential for providing location-based Web applications. However, geographical information in Web documents is rarely supplied in a machine-readable way and therefore not easily discoverable. As a consequence, it is necessary to extract geographical keywords from Web documents and to associate locations with them. This method is c...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2000

Record Location and Reconfiguration in Unstructured Multiple-Record Web Documents

نویسندگان

چکیده

منابع مشابه

Locating and Reconfiguring Records in Unstructured Multiple-Record Web Documents

Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages

Recognizing Ontology-Applicable Multiple-Record Web Documents

Adaptive Approximate Record Matching

Geographic Focus Detection for Web Documents using Multiple Location Taggers

عنوان ژورنال:

اشتراک گذاری